Skip to content

feat(sandbox): switch device plugin to CDI injection mode#503

Open
elezar wants to merge 2 commits intomainfrom
feat/device-plugin-cdi-injection
Open

feat(sandbox): switch device plugin to CDI injection mode#503
elezar wants to merge 2 commits intomainfrom
feat/device-plugin-cdi-injection

Conversation

@elezar
Copy link
Member

@elezar elezar commented Mar 20, 2026

Summary

Configure the NVIDIA device plugin to use deviceListStrategy: cdi-cri so GPU devices are injected via direct CDI device requests in the CRI. Sandbox pods now only need nvidia.com/gpu: 1 in their resource limits — runtimeClassName is no longer set on GPU pods.

Related Issue

Related to #398

Changes

  • deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml: add deviceListStrategy: cdi-cri, cdi.nvidiaHookPath, and nvidiaDriverRoot: "/" to Helm values
  • crates/openshell-server/src/sandbox/mod.rs: remove runtimeClassName insertion for GPU pods in both sandbox_template_to_k8s() and inject_pod_template(); add unit test asserting CDI path sets no runtimeClassName
  • architecture/gateway-single-node.md: update GPU Enablement section to document CDI injection mode
  • .agents/skills/debug-openshell-cluster/SKILL.md: add Step 8 with CDI-specific diagnostics (nvidia-ctk cdi list, device plugin logs, CDI spec files)

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (not applicable — CDI-specific assertion not observable from inside the sandbox)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated

@elezar elezar self-assigned this Mar 20, 2026
@elezar elezar force-pushed the feat/device-plugin-cdi-injection branch from 6840a21 to 6682d7d Compare March 20, 2026 17:00
elezar added 2 commits March 20, 2026 21:47
Configure the NVIDIA device plugin to use deviceListStrategy=cdi-cri so
that GPU devices are injected via direct CDI device requests in the CRI.
Sandbox pods now only require the nvidia.com/gpu resource request —
runtimeClassName is no longer set on GPU pods.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/device-plugin-cdi-injection branch from 6682d7d to 9c39785 Compare March 20, 2026 20:47
@elezar elezar marked this pull request as ready for review March 20, 2026 20:47
@elezar elezar requested a review from a team as a code owner March 20, 2026 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant